User-Level Socket-Based Checkpointing for Distributed and Parallel Computation

نویسندگان

  • Jason Ansel
  • Michael Rieker
  • Gene Cooperman
چکیده

We present a preliminary description of a user-level checkpointing package, DMTCP, for Linux. The socket-based approach presents a novel method for checkpointing distributed processes. This includes checkpointing of any dynamically created POSIX threads and forked child processes. It also includes checkpointing of remotely spawned processes via ssh and other mechanisms. As with all user-level checkpointing, no modification of the kernel is needed, and the application code is not modified. The package also checkpoints signal handlers, ordinary file descriptors, socket descriptors, and certain other types of file descriptors. Each checkpointed process has an associated checkpoint file. Hence, process migration, and even migration of an entire computation to a new cluster, are achieved through the simple expedient of copying checkpoint files to a new host. However, process migration adds the additional restriction that the source and destination host must be homogeneous.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

DMTCP: Scalable User-Level Transparent Checkpointing for Cluster Computations

As the size of clusters increases, failures are becoming increasingly frequent. Applications must become fault tolerant if they are to run for extended periods of time. We present DMTCP (Distributed MultiThreaded CheckPointing), the first user-level distributed checkpointing package not dependent on a specific message passing library. This contrasts with existing approaches either specific to l...

متن کامل

Application-Level Checkpointing Techniques for Parallel Programs

In its simplest form, checkpointing is the act of saving a program’s computation state in a form external to the running program, e.g. the computation state is saved to a filesystem. The checkpoint files can then be used to resume computation upon failure of the original process(s), hopefully with minimal loss of computing work. A checkpoint can be taken using a variety of techniques in every l...

متن کامل

Automatic Parallel Program Checkpointing in Message-Passing Environments

Problem of efficient cluster resources usage is very important, because of high demand for parallel computations. Checkpointing allows to manage cluster computing time more efficiently. In this article parallel programs checkpointing problems are discussed and implementation of automatic parallel checkpointing systems for MPI programs is presented. It is based on simple user-space portable chec...

متن کامل

Application Level Fault Tolerance in Heterogenous Networks of Workstations

We have explored methods for checkpointing and restarting processes within the Distributed object migration environment (Dome), a C++ library of data parallel objects that are automatically distributed over heterogeneous networks of workstations (NOWs). System level checkpointing methods, although transparent to the user, were rejected because they lack support for heterogeneity. We have implem...

متن کامل

Asynchronous Checkpointing for PVM Requires Message-Logging

Distributed computing using networked workstations o ers cost-e cient parallel computing, but the higher rate of failure requires e ective fault-tolerance. Asynchronous consistent checkpointing o ers a low-overhead solution. Parallel Virtual Machine (PVM) allows a heterogeneous network of UNIX workstations to serve immmediately as a distributed computer by providing message-passing services imp...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/cs/0701037  شماره 

صفحات  -

تاریخ انتشار 2007